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Abstract 

Background: The regulation of a gene depends on the binding of transcription factors to specific sites located 
in the regulatory region of the gene. The generation of these binding sites and of cooperativity between them 
are essential building blocks in the evolution of complex regulatory networks. We study a theoretical model for 
the sequence evolution of binding sites by point mutations. The approach is based on biophysical models for the 
binding of transcription factors to DNA. Hence we derive empirically grounded fitness landscapes, which enter a 
population genetics model including mutations, genetic drift, and selection. 

Results: We show that the selection for factor binding generically leads to specific correlations between nucleotide 
frequencies at different positions of a binding site. We demonstrate the possibility of rapid adaptive evolution 
generating a new binding site for a given transcription factor by point mutations. The evolutionary time required 
is estimated in terms of the neutral (background) mutation rate, the selection coefficient, and the effective 
population size. 

Conclusions: The efficiency of binding site formation is seen to depend on two joint conditions: the binding site 
motif must be short enough and the promoter region must be long enough. These constraints on promoter 
architecture are indeed seen in eukaryotic systems. Furthermore, we analyse the adaptive evolution of genetic 
switches and of signal integration through binding cooperativity between different sites. Experimental tests of 
this picture involving the statistics of polymorphisms and phylogenies of sites are discussed. 



Background 

The expression of a gene is controlled by other genes 
expressed at the same time and by external signals, 
a process called gene regulation [1] . Due to the com- 
binatorial complexity of regulation, a large number 
of functional tasks can be performed by a limited 
number of genes. Differences in gene regulation are 
believed to be a major source of diversity in higher 



eukaryotes. 

To a large extent, gene regulation is the control 
of transcription. It is accomplished by a number of 
regulatory proteins called transcription factors that 
bind to specific sites on DNA. These binding sites 
contain about 10—15 base pairs relevant for bind- 
ing and are mostly located in the cis-regulatory pro- 
moter region of a gene. A cis-regulatory region in 
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E. coli is about 300 base pairs long and contains 
a few transcription factor binding sites [2]. There 
may be two or more sites for the same factor in one 
promoter region. At the same time, the sequences 
of binding sites are fuzzy, that is, different sites for 
the same factor differ by about 20 — 30 percent of 
the bases relevant for binding [2]. This makes the 
identification of sites a difficult bioinformatics prob- 
lem [3-5]. Frequently, the simultaneous binding at 
two nearby sites is energetically favoured. This so- 
called binding cooperativity can be related to various 
functions. In a genetic switch such as the famous 
phage lambda switch in Escherichia coli [6], it pro- 
duces a sharp increase of the expression level at a cer- 
tain threshold concentration of a transcription fac- 
tor. A pair of sites for two different kinds of factors 
with cooperative binding can be a simple module for 
signal integration, leading to the expression of the 
downstream gene only when both kinds of factors 
are present simultaneously [1]. These examples are 
discussed in more detail below. Regulation in higher 
eukaryotes shares these features but is vastly more 
complicated [7]. A promoter region is typically a 
few thousand base pairs long and contains many dif- 
ferent binding sites with often complex interactions. 
At the same time, individual sites are shorter, with 
about 5-8 relevant base pairs. The sites are some- 
times organized in modules interspersed between re- 
gions containing no sites. In many known cases, the 
expression of a gene depends on the simultaneous 
presence of several factors. Well-studied examples 
of regulatory networks in eukaryotes include the sea 
urchin Strongylocentrotus purpuratussea [8] and the 
early developmental genes in Drosophila [9]. 

The sequence statistics of binding sites has been 
addressed in two recent theoretical studies [10,11]. 
Based on a model incorporating the biophysics of 
sequence-factor interaction [12, 13], a fitness land- 
scape for binding site sequences is constructed (see 
the discussion in the next section). The result- 
ing mutation-selection equilibrium is analysed using 
a mean- field quasispecies approach [14]. This ap- 
proach, which neglects the effects of genetic drift, is 
applicable in very large populations. In both stud- 
ies [10,11], fuzziness is attributed to mutational en- 
tropy as a possible reason: the single or few sequence 
states with optimal binding of the transcription fac- 
tor can be outweighed by the vastly higher number 
of sub-optimal states at some mutational distance 
from the optimal binding sequence. This effect is 
similar to the fuzziness of amino acid sequences in 



proteins discussed in [15]. 

From an evolutionary perspective, explaining 
the molecular programming of regulatory networks 
presents a striking problem. The diversification of 
higher eukaryotes, in particular, requires the efficient 
generation and alteration of regulatory binding in- 
teractions. One likely mode of evolution is gene du- 
plications with subsequent complementary losses of 
function in both copies [16,17]. However, the dif- 
ferentiation of regulation should also require com- 
plementary processes that generate new functions of 
genes as a response to specific demands. This task 
must be accomplished mainly by sequence evolution 
of regulatory DNA. There are examples of highly 
conserved regulatory sequences with a conserved 
function but binding sites can also appear, disap- 
pear, or alter their sequence even between relatively 
closely related species; see, e.g., refs. [18-22]. This 
turnover of binding sites has been argued to follow an 
approximate molecular clock in Drosophila [23]. The 
transcription factors themselves arc known to remain 
more conserved, especially if they are involved in the 
regulation of more than one gene. 

The modes of regulatory sequence evolution and 
their relative importance remain largely to be ex- 
plored. Contributions may arise from point muta- 
tions, slippage processes [24], and larger rearrange- 
ments of promoter regions [25] . The latter processes 
may lead to the shuffling of entire modules of bind- 
ing sites between different genes. In this paper, 
we are more interested in the local sequence evo- 
lution within a module, which has been argued to 
contribute most of the promoter sequence difference 
between species [26]. It is also the most promising 
starting point for a quantitative analysis of binding 
site evolution. We study a theoretical model that 
takes into account point mutations, selection, and 
genetic drift. The form of selection is inferred from 
the biophysics of the binding interactions between 
transcription factors and DNA. 

We derive the stationary distribution of binding 
sites under selection, which shows specific correla- 
tions between nucleotide frequencies at different po- 
sitions in a binding site. The non-stationary solu- 
tions of the model describe efficient adaptive path- 
ways for the molecular evolution of regulatory net- 
works by point mutations. This efficiency can be 
quantified in terms of the length of the binding motif, 
and the length of the promoter region, and the fit- 
ness landscape for factor binding, which is amenable 
to quite explicit modeling. 
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With the parameters found in natural systems, 
our model predicts that a new binding site for a given 
transcription factor can be generated by a fast series 
of adaptive substitutions, even if the expression of 
the corresponding gene bears even a modest fitness 
advantage. The evolutionary time required for site 
formation in response to a newly arising selection 
pressure is estimated in terms of the characteristic 
time scales of mutation, selection, and drift. For 
Drosophila, it may be as short as 10 5 years even for 
moderate selection pressures. However, this path- 
way is found to depend crucially on the presence of 
selection. It would be too slow under neutral evolu- 
tion, in contrast to the results of [7] , see also the re- 
cent discussion in [27]. Cooperative interactions be- 
tween binding sites can evolve adaptively on similar 
time scales, as we show for the two simple examples 
alluded to above, the genetic switch and the signal 
integration module. These results are discussed at 
the end of the paper with particular emphasis on 
possible experimental tests. 



Factor binding and selection 

The binding energy (measured in units of ksT) be- 
tween a transcription factor and its binding site 
is, to a good approximation, the sum of indepen- 
dent contributions from a small number of impor- 
tant positions of the binding site sequence, E/ksT = 
£- =1 £i, with I « 10 - 15 [28-30]. The individual 
contributions £j depend on the position i and on 
the nucleotide at at that position. There is typi- 
cally one particular nucleotide a* preferred for bind- 
ing; the sequence (a*, . . . , a|) is called the target se- 
quence. The target sequence can be inferred as the 
consensus sequence of a sufficiently large number of 
equivalent sites. The so-called energy matrix Si(a) 
has been determined experimentally for some factors 
from in vitro measurements of the binding affinity for 
each single-nucleotide mutant of the target sequence. 
Typical values for the loss in binding energy are 1-3 
ksT per single-nucleotide mismatch away from the 
target sequence. In this paper, we use the further 
approximation £j ~ e if ai — a* and e = other- 
wise, the so-called two-state model [12]. The binding 
energy of any sequence (a±, . . . ,a£) is then, up to 
an irrelevant constant, simply given by its Hamming 
distance r to the target sequence: E/ksT = er. 
(The Hamming distance is defined as the number of 
positions with a mismatch a* ^ a*.) 

It is important to note the status of this "min- 



imal model" of binding energies for the discussion 
in this paper. Both approximations underlying the 
model can be violated. Even though typical mis- 
match energies are of the same order of magnitude, 
there can be considerable differences between differ- 
ent substitutions at one position and between differ- 
ent nucleotide positions. Moreover, deviations from 
the approximate additivity of binding energies for 
the single nucleotide positions have also been ob- 
served. However, these complications do not af- 
fect the order-of-magnitude estimates for adaptive 
sequence evolution. As it will become clear, the effi- 
ciency of binding site formation depends only on the 
qualitative shape of the fitness landscapes derived 
below. In these landscapes, the regime of weakly- 
binding sequences and of strongly-binding sequences 
are separated by only a few single nucleotide substi- 
tutions. The relative magnitude of the fitness in- 
crease of these substitutions does not matter in first 
approximation. Indeed, inhomogeneities in the val- 
ues of the (a) tend to reduce the number of crucial 
steps in the adaptive process and thereby to further 
increase its speed. 

Within the two-state model, the binding prob- 
ability of the factor in thermodynamic equilibrium 
is 



1 l + exp[£(r -/>)]■ v ' 

Here s is the binding energy per nucleotide mismatch 
and ep is the chemical potential measuring the fac- 
tor concentration. Both parameters are expressed 
in units of fc^T and hence dimensionless. Appro- 
priate values for typical binding sites have been dis- 
cussed extensively in refs. [10, 13]. It is found that 
e should take values around 2, which is consistent 
with the measurements for known transcription fac- 
tors mentioned above [28-30]. The chemical po- 
tential depends on the number of transcription fac- 
tors present in the cell, on the binding probability 
to background sites elsewhere in the genome (which 
have a sequence similar to the target sequence by 
chance), and on the functional sites in the in the 
genome other than the binding site in question that 
may compete for the same protein. Binding to back- 
ground sites does not significantly reduce the binding 
to a specific functional site [13]. This leads to values 
p « (logn/)/£ w 2 — 4, given observed factor num- 
bers n/ of about 50 — 5000 [13]. Binding to other 
copies of the same functional sequence becomes only 
relevant at low factor concentrations and high num- 
ber of copies, when sites compete for factors. 
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A fitness landscape quantifies the fitness contri- 
bution F(a\, . . . , ag) of each sequence state at the 
binding site. Fitness differences arise due to differ- 
ent expression levels of the regulated gene, and these 
in turn depend on the binding of the transcription 
factors. It is only these fitness differences that enter 
the population dynamics of binding site sequences 
in the next section. Following the conceptual frame- 
work of ref. [10], we assume that the environment of 
the regulated gene can be described by a number of 
cellular states (labelled by the index a) with different 
transcription factor concentrations, i.e., with differ- 
ent chemical potentials p a . These cellular states can 
be thought of as different stages within a cell cycle. 
In each state, the fitness depends on the expression 
level of the regulated gene in a specific way. This 
expression level is determined by the binding proba- 
bility p a of the transcription factor. Assuming that 
both dependencies are linear (this is not crucial) and 
that the cellular states contribute additively to the 
overall fitness F, we obtain 

F = J2s a P a - (2) 

ot 

Here the selection coefficient s a is defined as the 
fitness difference (due to different expression of the 
downstream gene) between the cases of complete fac- 
tor binding and no binding in the state a. Such 
fitness differences can now be measured directly in 
viral systems [31]. Inserting JJJ, the fitness becomes 
a function of the Hamming distance r only. We note 
that the fitness F is measured relative to that of a 
sequence with zero binding probability in any state 
a. 

In a simple case, there are just two relevant cel- 
lular states. The on state favours expression of the 
gene, the off state disfavours it. It is then natural 
to assume selection coefficients of similar magnitude; 
here we take for simplicity s = s° n = — s off > 0. We 
then obtain a crater landscape, 



y ' 1 + cxp[e(r - p on )] 1 + exp[e(r - p oS )} ' 

(3) 

with a high-fitness rim between p oS and p on flanked 
by two sigmoid thresholds; see fig. 1(a). The generic 
features of this fitness landscape are easy to inter- 
pret: the two-state selection assumed here favors in- 
termediate binding strength (i.e., intermediate Ham- 
ming distances r) where binding occurs and the gene 
is expressed in the on state but not in the off state. 
Sequences with large Hamming distance r > p on can 



bind the factor neither in the on nor in the off state, 
while sequences with r < p g lead to binding in the 
on and the off state. Both cases lead to misregula- 
tion of the downstream gene, and hence to a lower 
fitness. We note that the key feature of these fitness 
landscapes, the sigmoid thresholds, is independent 
of the particular choices of s° n and s off . 

An even simpler fitness landscape is obtained if 
only the on state contributes significantly to selec- 
tion, i.e., if s = s on > and s oS = 0. The crater 
landscape then reduces to the mesa landscape dis- 
cussed in [10,32], 

F(r) = 1 + exp[e(r - ' (4) 

which has a high-fitness plateau of radius p and one 
sigmoid threshold; see fig. 1(b). In this case, all se- 
quences with sufficiently small Hamming distance to 
the target sequence (r < p on ) have a high fitness. 

In both cases, the parameters of the binding 
model have a simple geometric interpretation: e 
gives the slope and the p a give the positions of the 
sigmoid thresholds in the fitness landscape. Eqs. © 
and J3J are again to be understood as minimal mod- 
els of fitness landscapes for binding sites, represent- 
ing target sequence selection for a given level of bind- 
ing (p off < r < p on ) and for sufficiently strong bind- 
ing (r < p on ), respectively. Despite its simplicity, 
this type of selection model based on biophysical 
binding affinities is nontrivial from a population- 
genetic viewpoint since it leads to generic correla- 
tions between frequencies of nucleotides cij and aj 
within a site, see the Results section below. We will 
also study generalized models with correlations be- 
tween two sites generated by cooperative binding. 
On the other hand, these models neglect the context 
dependence of the binding process through cofactors 
and chromatin structure. However, they are a good 
starting point for order-of magnitude estimates of 
the adaptive evolution of binding sites. 

Mutation, selection, and genetic drift 

The rates of nucleotide point mutation show a great 
variation, ranging from p ~ 10 per site and gen- 
eration for RNA viruses to values several orders of 
magnitude lower in eukaryotes, e.g., p » 2 x 10~ 9 
in Drosophila [33]. (Here we model mutation as a 
single-parameter Markov process; we do not distin- 
guish between transitions and transversions.) The 
evolution of a sufficiently large population under mu- 
tation and selection can be described in terms of the 
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average fraction of the population with a given bind- 
ing sequence. This so-called mean-field approach ne- 
glects the fluctuations due to finite population size 
(genetic drift). It leads to the so-called quasispecies 
theory [14]. For a population of sequences at a single 
binding site, the quasispecies population equation 
can be written for the fraction n(r, t) of individuals 
at Hamming distance r from the target sequence at 
time t. Along with a generalisation for two binding 
sites, it has been analysed in detail in ref. [10]. For 
the mesa landscape, the stationary solution n s t a t(?") 
has been found exactly [32]. It depends only on the 
ratio s/fi and describes a stable polymorphic pop- 
ulation, i.e., several sequence states coexist. The 
mean-field approach is valid as long as the stochas- 
tic reproductive fluctuations are leveled out by mu- 
tations. This requires absolute population numbers 
iVYigtat (?*) S> 1/a* for all relevant r, a stringent con- 
dition on the total population size A. 

This paper is concerned with a different regime of 
population dynamics, as described by the Kimura- 
Ohta theory for finite populations evolving by 
stochastic fluctuations (genetic drift) and selec- 
tion [34-36]. According to this theory, a new mu- 
tant with a fitness difference AF relative to the pre- 
existing allele could spread to fixation in the popu- 
lation. This is a stochastic process, whose rate con- 
stant is given by 



u = fiN 



1 - cxp(-2AF) 
1 - cxp(-2AAF) 



(5) 



in a diffusion approximation valid for AF <C 1 [37]. 
Here A is the effective population size (with an addi- 
tional factor 2 for diploid populations). Eq. J5J has 
three well-known regimes. For substantially delete- 
rious mutations (JVAF< — 1), substitutions are ex- 
ponentially suppressed. Nearly neutral substitutions 
(A|AF| <C 1) occur at a rate u ~ fi approximately 
equal to the rate of mutations in an individual. For 
substantially beneficial mutations (AA.F>1), the 
substitution rate is enhanced, with u ~ 2/iAAF for 
NAF > 1. 

In this picture, a population has a monomorphic 
majority for most of the time and occasional coex- 
istence of two sequence states while a substitution 
is going on. The time of coexistence is T ~ A for 
nearly neutral and T ~ 1/AF for strongly benefi- 
cial substitutions. The picture is thus self-consistent 
for Tu <C 1, i.e., for /iA <C 1. Asymptotically, it 
describes monomorphic populations moving through 
sequence space with hopping rates u. 



Introducing an ensemble of independent popula- 
tions, this stochastic evolution takes the form of a 
Master equation. For a single binding site, we ob- 
tain 

mP(r,t) = 

(c - l){t - r + 1) w r _i, r P(r - l,t) + 

(r + l)u r+1 , r P(r + l,t)- 

[r u r , r _i + (c-l)(£- r) u rir+1 }P(r, t). (6) 

Here P(r, t) denotes the probability of finding a 
population at Hamming distance r from the tar- 
get sequence, and u r y is given by (JSJ) with AF = 
F(r') — F(r). The combinatorial coefficients arise 
since a sequence at Hamming distance r can mutate 
in (c — 1)(£ — r) different ways that increase r, and 
in r ways that decrease r, where c = 4 is the number 
of different nucleotides. The stationary distribution 
is 

P stat (r) ~ exp[S(r) + 2NF(r)]. (7) 

Here S(r) = log[(^)(c— l) r /c e ] is the mutational en- 
tropy (the log fraction of sequence states with Ham- 
ming distance r) [32] and we have used the exact 
result u r+ i^/u r ^ r+ i = e 2 ( JV - 1 ) AF . To derive Q), we 
then simply approximated A — 1 by A. The form of 
-Pstat(f) reflects the selection pressure, i.e., the scale 
s of fitness differences in the landscape F{r). For 
neutral evolution (2s A = 0), the stationary distri- 
bution 

i?tatM~£exp[S(r)] ( g ) 

is obtained from a flat distribution over all sequence 
states. For moderate selection (2sA ~ 1), -P st at(?") 
results from a nontrivial balance of stochasticity 
and selection. For strong selection (2s A 3> 1), 
-Pstat(f) takes appreciable values only at points of 
near-maximal fitness, where F(r) >F m — l/2sA. In 
this regime, the dynamics of a population consists 
of beneficial mutations only, i.e., the system moves 
uphill on its fitness landscape. 

The Master equation © and the mean-field 
quasispecies equation thus describe opposite asymp- 
totic regimes, ^A <^ 1 and fiN ^> 1, of the evolu- 
tionary dynamics. Effective population sizes show a 
large variation, from values of order 10 9 in viral sys- 
tems to A ~ 10 6 in Drosophila and A - 10 4 - 10 5 in 
vertebrates. (These numbers bear some uncertainty; 
one reason is that A varies across the genome [38].) 
We conclude that the mean-field quasispecies is well 
suited for viral systems, while eukaryotes clearly 
show a stochastic dynamics of substitutions. 
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Results and discussion 

Stationary distributions and nucleotide frequency 
correlations 

In the previous sections, we have expressed the fit- 
ness landscape and the resulting population distri- 
butions as a function of the Hamming distance r 
because it is a convenient parameterization of the 
binding energy in the two-state model. In order to 
compare this approach to standard population ge- 
netics, it is useful to recast eq. (0 for the elementary 
sequence states (a%, . . . , ai), 

PLt(r)= E ^tat(ai,...,a ; ) , (9) 

(oi,...,oj)|r 

where the sum runs over all sequence states at fixed 
r. At neutrality, the distribution over sequence 
states factorizes in the single nucleotide positions, 

l 

PS M (a 1 ,...,a l )=l[ Vo {a i ). (10) 

i=l 

In the specific case of the two-state model, t'o(ai) is 
simply a flat distribution over nucleotides but it is 
obvious how this form can be generalized to arbi- 
trary nucleotide frequencies. 

According to eq. Q , the stationary distribution 
under selection takes the form 

P s tat(ai, . . ■ , a,) = P° tat K •••,«;) exp[2iVF(r)]. 

(11) 

The salient point is that F(r) is generically a 
strongly nonlinear function of r due to the sigmoid 
dependence of the binding probability on r. An anal- 
ogous statement holds beyond the two-state approx- 
imation for the dependence of F on the binding en- 
ergy E. Hence, even if 'P s tat (ai, . . . , a;) factorizes in 
the single nucleotide positions, 7 3 s tat(«i, • • • , &0 does 
not. The selection introduces specific correlations 
between the nucleotides: the fitness differences and, 
hence, the nucleotide frequencies at one position i 
depend on all other I — 1 positions in the motif. 

Adaptive generation of a binding site 

We now apply the dynamics iJBJ to the problem of 
adaptively generating a binding site in response to a 
newly arising selection pressure. We study a case 
of strong selection (sN = 100) in the crater fit- 
ness landscape with parameters I = 10, e = 2, 
p on = 3, p oS = 1 (implying that the factor con- 
centrations differ by a factor of 50), and a case of 



moderate selection (sN = 7) in the mesa landscape 
with parameters I = 10, e = 1, p = 3.6. (The 
mesa type may be most appropriate for factors with 
multiple binding sites such as the CRP repressor in 
E. coli, where binding to an individual site is negligi- 
ble in the off state.) The fitness landscapes for both 
cases are shown in fig. l(a,b) in units of the selec- 
tion pressure s. Substantially beneficial mutations 
occur only on their sigmoid slopes, i.e., in narrow 
ranges of r. The upper boundary of this region is 
given by r s = p on + log[s7V(e £ — l)]/e, which takes 
typical values r s = 5 — 7. In fig. l(c,d), we show a 
sample history of adaptive substitutions from r = 5 
to lower values of r, which are close to the point 
r m of maximal fitness. The statistics of this adap- 
tation is governed by the ensemble P(r,t); the av- 
erage f(t) and the standard deviation Sr(t) appear 
also in fig. l(c,d). In the case of strong selection, 
the expected time of the adaptive process is readily 
estimated in terms of the uphill rates in © , 

1 Ts 1 
Ts = %lN r ^ +i r(F(r-l)-F(r)y {U) 

and takes values of a few times 1/spN. We em- 
phasize again that this simple form depends only on 
the qualitative form of the fitness landscape, namely, 
that weakly and strongly binding sequence states are 
separated only by few point mutations. The conclu- 
sions are thus largely independent of the details of 
the fitness landscape, which justifies using the two- 
state approximation. 

Can such a selective process actually happen? 
This depends on the initial state of the promoter 
region in question before the selection pressure for 
a new site sets in. The region is approximated as 
an ensemble of Li = L — £ + 1 candidate sites 
undergoing independent neutral evolution, i.e., the 
simultaneous updating of £ sites by one mutation 
is replaced by independent mutations. The length 
of the promoter region is denoted by L. At sta- 
tionarity, the Hamming distance at a random site 
then follows the distribution P s t a t(f) ~ exp[S'(r)] 
shown as empty bars in fig. l(e,f). The minimal 
distance r min in the entire region is given by the 
distribution V(r) = Q^ t (r) - Q^ t (r + 1), where 
QstatM = J2r'>r p stat{r') is the cumulative distri- 
bution for a single site. V(r) is found to be strongly 
peaked, taking appreciable values only in the range 
r m i n (£, L) ± 1 around its average. We assume se- 
lective evolution sets in as soon as at least one site 
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(a.) 



(b) 




Figure 1: (a) Crater landscape and (b) Mesa landscape (@J, as a function of the Hamming distance 
r from the target sequence (within the approximation of the two-state model). r m gives the point where 
the binding probability reaches a maximum (crater landscape), or else values close to 1 (mesa landscape). 
r s approximately indicates the onset of selection, i.e. a binding probability appreciably different from zero, 
(c) Adaptive dynamics as a function of time t measured in units of l/(2s^N) in the crater landscape at 
strong selection (sN — 100). Single history r{t) (dashed lines), ensemble average f(t) (thick solid lines) and 
width given by the standard deviation curves f(t) ± Sr(t) (thin solid lines), (d) Same as (c) in the mesa 
landscape at moderate (sN = 6.8) selection, (e) Stationary ensembles P s t a t(f*) of binding site sequences with 
in the crater landscape at strong selection (filled bars) and for neutral evolution (empty bars), (f) Same as 
(e) in the mesa landscape at moderate selection, together with the histogram of Hamming distances of CRP 
site sequences in E. coli from their consensus sequence (diamonds, from [10]). 



has a Hamming distance r < r s . This is likely to 
happen spontaneously if r s > r m i n (£, L) , leading to a 
joint condition on£, L, and r s . For r s < r m i n (£, L)— 1, 
there is a neutral waiting time before the onset of 
adaptation. Its expectation value 

1 QfaV(r s + l) 
° filers + l)P stat {r s + l) [ 1 

is calculated in the appendix. It is generically much 
larger than the adaptation time T s , rendering the 
effective generation of a new site less feasible. 



The stationary distribution P s tat (r) under selec- 
tion is given by (Q) and shown as filled bars in 
fig. l(e,f). For strong selection, it is peaked at the 
point r m of maximal fitness. For moderate selection, 
it takes appreciable values for r = — 4: the binding 
site sequences are fuzzy. Assuming that the CRP 
sites at different positions in the genome of E. coli 
have to a certain extent evolved independently, we 
can fit P s tat if) with their distance distribution (data 
taken from [10]). At the values of e and p on cho- 
sen, the two distributions fit well, see fig. 1(f). This 



7 



finding is discussed in more detail below. 

Adaptation of binding cooperativity 

The cooperative binding of transcription factors in- 
volves protein-protein interactions which may be 
specific to the DNA substrate. These interactions 
often do not require conformational changes of ei- 
ther protein involved and depend only on few spe- 
cific contact points. They result in a modest en- 
ergy gain of order 3 — 4fcsT [1]. Hence, it is a 
reasonable simplification to study the adaptive ad- 
justment of binding affinities using a simple gener- 
alisation of the two-state binding model. We de- 
fine the energies Ex/UbT = £Ti and _E 2 /fcsT = £T 2 
for the binding of a single factor and impair /fcs J 1 = 
e[n + r 2 — 2(i /£)(£ — f)) for the simultaneous binding 
of both factors. The cooperativity gain is assumed 
to result from mutations at I positions in the DNA 
sequences of the factors, which encode the amino 
acids at the protein-protein contact points. These 
mutations define a Hamming distance f = 0, . . . , £ 
from the target sequence for optimal protein-protein 
binding, and 2je/£ is the binding energy per nu- 
cleotide. Here we use the values e — 2, i — 6 and 
7=1 but the qualitative patterns shown below are 
rather robust. 

The resulting equilibrium probabilities for the 
four thermodynamic states ( ) (both factors un- 
bound), (h — ) and ( — h) (one factor bound), and 
(++) (both factors bound) are 

Q — . 

Q+- = Q — exp[-e(ri - p x )], 
Q+- = 1 — cxp[-e(r 2 - p 2 )j, 
Q++ = Q — exp[-e(n + r 2 - pi - p 2 - 27)], 

(14) 

with the normalisation q + + q |_ + q++ = 1. 

The scaled chemical potentials pi and p 2 are inde- 
pendent variables if the two sites bind to different 
kinds of factors and are equal if they bind to the 
same kind. As before, the binding probabilities de- 
termine expression levels and, therefore, the fitness. 
Here we study only pairs of sites contributing addi- 
tively to the expression level in each cellular state, 
where we have 

F = J2 S °(<1+-+<1- + + 2 <1+ + )- (15) 

a 

Other important cases include activator-repressor 
site pairs such as the famous lac operon [39] , where 
the transcription-factor induced expression level is 



proportional to g_| The stochastic dynamics of 

substitutions is straightforward to generalise; it leads 
to a Master equation like JfjJ for the joint distribu- 
tion P(r\ , r 2 , f, t) . This higher-dimensional equation 
can again be solved exactly for its steady state 



Pstat (n,r 2 ,f) ~ exp[S(r 1 )+S(r 2 )+S(r)+2NF(r 1 ,r 2 ,P)}. 

(16) 

Here we discuss two simple examples of fitness 
landscapes where binding cooperativity evolves by 
adaptation to specific functional demands. A ge- 
netic switch with a sharp expression threshold is 
favoured in a system with a single transcription 
factor having similar concentrations in its on and 
off cellular state. As can be seen from eq. <(T4"|) . 
cooperative binding can sharpen the response of 
the binding probability to variations in factor con- 
centration, q ++ ~ 1/[1 + exp(— 2ep + . . .)] versus 
p ~ 1/[1 + cxp(— ep +•■■)] as given by for indi- 
vidual binding. Figs. 2(a,c) show the fitness land- 
scape F(ri,r 2 ,7) obtained from (TH)! and (|T5)l for 
p on = 2.5, p off = 1.5, and s = s on = -s off . A simple 
signal integration module responds to two different 
factors in four different cellular states, (on, on), (on, 
off), (off, on), (off, off). Individually weak but coop- 
erative binding leads to expression of the gene only 
if both factors are present simultaneously. This case 
is favoured by a fitness function of the form ljl5|l 
with selection coefficients s — — s off .° ff = _ s on,off _ 
_ s ofT,on = s on,oa/ 2 . The resulting fitness landscape 
F(ri,r2,j) is shown in figs. 2(b,d) for chemical po- 
tentials p on = 3, p oS = 1 (for each factor). 

In both cases, a pair of sites with weaker indi- 
vidual binding (r\ , r 2 = 3 — 4) and cooperativity 
(7 = 1) is seen to have a higher fitness than an op- 
timal pair (?*i = ?* 2 = 2) without cooperativity, as 
expected. Adaptive pathways Txfi(t) and 7(f) for 
strong selection (sN — 100) are shown in fig. 2(e,f). 
Typical adaptation times T s are again a few times 
l/(spN). A closer look reveals that this fast adap- 
tation sometimes leads to a metastable local fitness 
maximum with some degree of cooperativity. Com- 
pensatory mutations (see below) are then required to 
reach the global maximum, a process that may be 
considerably slower. The fuzziness <5ri j2 (i) and S^/(t) 
observed in fig. 2(e,f) decays on the larger time scale 
of compensatory mutations, reflecting the presence 
of such metastable states. 
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t t 
Figure 2: Genetic switch (left column), signal integration module (right column). (a,b) Fitness landscape 
F(ri,r 2 ) without cooperativity (7 = 0). (c,d) Fitness landscape F(ri,r 2 ) with cooperativity (7 = 1). Next- 
nearest neighbour states (r%,r2) and (r[ = r\ ± f,r 2 = r-x ± 1) of similar fitness are linked by compensatory 
mutations if the intermediate states {r\,r' 2 ) and (r^,^) have lower fitness. (e,f) Adaptive dynamics: ensem- 
ble averages r\(t) = T2{t) and j(t) (thick lines), ensemble width given by f\(t) ± 5r\(t) (same for r 2 ) and 
7(t) ± &y{t) (thin lines); cf. fig. l(e,f). 



Conclusions 

Transcription factors and their binding sites emerge 
as a suitable starting point for quantitative studies 
of gene regulation. Binding site sequences are short 
and their sequence space is simple. Moreover, the 
link between sequence, binding affinity, and fitness is 
experimentally accessible. For a single site, the sim- 
plest examples are of the mesa [10] or of the crater 
type, see fig. l(a,b). Landscapes for a pair of sites 
with cooperative binding interactions are of a simi- 
lar kind as shown in fig. 2(a-d). They can be used to 
predict the outcome of specific single-site mutation 



experiments to a certain extent. 



Fast adaptation may generate or eliminate a new bind- 
ing site 

Despite this simplicity, the evolutionary dynamics 
of binding sites is far from trivial, since it is gov- 
erned, in the generic case, by the interplay of three 
evolutionary forces: selection, mutation, and genetic 
drift. Here we have focused on the dynamical regime 
appropriate for eukaryotes, where the evolution can 
be approximated as a stochastic process of substi- 
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tutions. We find the possibility of selective path- 
ways generating a new site in response to a newly 
arising selection pressure, starting from a neutrally 
evolved initial state and progressing by point sub- 
stitutions. Such a selective formation takes roughly 
T s w Ar/(2sfiN) generations, where Ar is the num- 
ber of adaptive substitutions required. This num- 
ber is given by the Hamming distance between the 
onset of selection and the point of optimal fitness, 
Ar = r s — r m , and takes values 2 — 3 for typical 
fitness landscapes; see fig. l(a,b). For Drosophila 
melanog aster, with /i w 2 x 10 -9 [33] and N « 10 6 , 
the resulting T s is of the order of 10 6 generations 
or 10 5 years even for sites with a relatively small 
selection coefficient s = 10~ 3 . Such selective pro- 
cesses are faster than neutral evolution by a factor of 
about 1000 and would allow for independent gener- 
ation of sites even after the split from its closest rel- 
ative Drosophila simulans about 2.5 x 10 6 years ago. 
Notice that new sites are more readily generated in 
large populations. As discussed above, generating 
a new site may also require a neutral waiting time 
T until at least one candidate site in the promoter 
region of the gene in question reaches the threshold 
distance r s from the target sequence, where selection 
sets in. For site formation to be efficient, however, 
selection must be able to set in spontaneously, i.e., 
To must not greatly exceed the adaptive time T s . 
This places a bound on the relevant length £ of the 
binding motif that can readily form in a promoter 
region of length L. Given L « 300, for example, a 
motif with I = 8 and r s — 3 could still allow for spon- 
taneous adaptive site formation. (For longer motifs, 
corresponding to groups of sites with fixed relative 
distance, this pathway would require promoter re- 
gions of much larger L.) A more general case has 
recently been treated numerically in [27], where the 
dependence of the neutral waiting time on the G/C 
ratio of the initial sequence has been investigated. 
One may speculate that this adaptive dynamics is 
indeed one of the factors influencing the length of 
regulatory modules in higher eukaryotes. 

Clearly, the present model also allows for path- 
ways of negative selection leading to the elimina- 
tion of spurious binding sites in regulatory or non- 
regulatory DNA where the binding has an adverse 
fitness effect. This is important since under neutral 
evolution, candidate sites with a distance of at most 
r s from the target sequence occur frequently on a 
genome- wide scale. A recent study has indeed found 
evidence for such negative selection from the under- 



representation of binding site motifs over the entire 
genome [40]. 



Binding sites under selection have nucleotide frequency 
correlations 

We have shown that under stationary selection the 
frequencies of nucleotides at any two positions of 
the binding sequence are correlated. For the two- 
state model, the correlations are the same for any 
pair of positions i ^ j and can be computed exactly 
from the joint distribution (HJ. We emphasize that 
these correlations refer to an ensemble of indepen- 
dently evolving (monomorphic) populations and are 
not to be confused with linkage disequilibria within 
one population. This finding limits the accuracy of 
bioinformatic weight matrices, which are often as- 
sumed to factorize in the nucleotide positions even 
in the presence of selection. 



Experimental tests: Binding site polymorphisms and 
phytogenies 

The predictions of our model lend themselves to 
a number of experimental tests. In the dynamical 
regime appropriate for eukaryotes (fiN pop- 
ulations should be monomorphic at most positions 
of their binding site sequences and polymorphic at 
a few. On the other hand, the quasispecies model 
discussed in refs. [10,11] (which assumes fj,N ^> 1) 
may be most appropriate in viral systems. The in- 
termediate regime p,N ~ 1 with frequent polymor- 
phisms and genetic drift could be realized in some 
bacterial systems and presents a challenge for the- 
ory. Thus it would be very interesting to compare 
the statistics of singlc-nucleotide polymorphisms at 
binding sites in eukaryotes, bacteria, and viruses. 
Polymorphism data are expected to contain evidence 
for adaptive evolution. However, statistical tests of 
selection must be modified for promoter sequences 
[40,41]. A recent study uses data on binding sites in 
three yeast species and deduces the rates of sequence 
evolution [42]. 

A complementary source of information are phy- 
logenies of binding sites. Trees with functional dif- 
ferences between branches contain information on 
the generation of new sites or of interactions be- 
tween sites and on the time scales involved. In a 
tree for a conserved site or group of sites with suffi- 
ciently long branches, the fuzziness of the sequences 
observed on different branches is given by the en- 
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semble P s tat introduced above. For strong selection, 
-Pstat lives on the quasi-neutral network of sequence 
states with maximal fitness, where two neighbouring 
sequence states are linked by neutral mutations or 
by pairs of compensatory mutations at two different 
positions. In the crater landscape for a single site, 
this quasi-neutral network consists of all sequences 
with a fixed distance r — r max from the target se- 
quence; see fig. 1(a). Beyond the two-state approxi- 
mation for binding energies, it will be smaller since 
only some of the positions are energetically equiva- 
lent. For a group of sites, however, quasi-neutral net- 
works can be larger since compensatory mutations 
can also take place at positions on different sites as 
shown in fig. 2(d) for the example of a signal inte- 
gration module. This is consistent with experimen- 
tal evidence that the sequence divergence between 
Drosophila melanogaster and Drosophila pseudoob- 
scura involves compensatory mutations and stabilis- 
ing selection between different binding sites [43]. 

For weaker selection, site fuzziness increases fur- 
ther since P s t a t extends beyond the sequence states 
of maximal fitness and is influenced by mutational 
entropy. As shown in fig. 1(f), one can explain in this 
way the observed fuzziness in CRP sites of E. coli. 
It would then reflect different evolutionary histories 
of independent populations, rather than sampling in 
one polymorphic population as in the quasispecies 
picture of refs. [10, 11]. (In a mean-field quasis- 
pecies, appreciable fuzziness occurs only for selec- 
tion coefficients s ~ /x, minute in other than viral 
systems.) However, the data are also compatible 
with strong selection if the selection coefficients s a , 
and hence the value of r m , vary between different 
genes. Clearly, comparing P s t a t with the distribution 
of sites in a single genome requires the assumption 
that the evolutionary histories of sites at different 
positions are at least to some extent independent. 
Future data of orthologous sites in a sufficient num- 
ber of species will be more informative. Thus, fur- 
ther experimental evidence is needed to clarify the 
role of mutational entropy in the observed fuzziness. 



Evolvability of binding sites 

The present work was aimed at obtaining some in- 
sight into the molecular mechanisms and constraints 
underlying the dynamics of complex regulatory net- 
works, thereby quantifying the notion of their evolv- 
ability. The programming of binding sites and of 
cooperative interactions between them is found to 



provide efficient modes of adaptive evolution whose 
tempo can be quantified for the case of point mu- 
tations. The formation of complicated signal inte- 
gration patterns and of multi-factor interactions in 
higher eukaryotes, however, requires generalizing our 
arguments in two ways. There are further modes 
of sequence evolution such as slippage events, in- 
sertions and deletions, large scale relocation of pro- 
moter regions, and recombination. Our ongoing 
work is aimed at quantifying their relative impor- 
tance in terms of substitution rates. Moreover, there 
are also more general fitness landscapes describing, 
e.g., binding sites interacting via the expression level 
of the regulated gene (such as activator-repressor site 
pairs) and the coupled evolution of binding sites in 
different genes. 

The rapid evolution of networks hinges upon the 
existence of adaptive pathways for these formative 
steps with a characteristic time scale T s ~ 1/(s/j.N) 
much smaller than To ~ l//i, the time scale of neu- 
tral evolution. The presence of these two time scales 
has a further interesting consequence. If the selec- 
tion pressure on an existing site ceases, that site will 
disappear on the larger time scale To. It is possible, 
therefore, that large existing networks have accumu- 
lated a considerable number of redundant regulatory 
interactions acquired by selection in their past. This 
may be one factor contributing to their robustness 
against perturbations. 

Methods - Neutral evolution of binding 
sites 

To estimate the average neutral waiting time To, we 
study the mutation dynamics in the restricted range 
r = r s + 1, ..., t, allowing mutations from r s + 1 
to r s but suppressing mutations from r s back to 
r s + 1. We evaluate the time-dependent solution 
P(r, t) of the Master equation (JSJ with the initial 
condition P(r, 0) = P s tat(?")j and the resulting cu- 
mulative probability Q(t) — J2r>r +i f^*)- The 
current across the lower boundary, J(t) = fJ>(r s + 
l)P(r s + l,i) = — dQ/dt, determines the waiting 
time for a single site, 

/>oc />oo 

T = / dff J{t) = \ dtQ(t). (17) 
Jo Jo 

This is formally solved by expanding in cigenfunc- 
tions of the mutation operator. In the case rel- 
evant here, the system remains close to equilib- 
rium since the boundary current is much smaller 
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than typical currents for r > r s . Hence, P(r,t) ~ 
P stat (r) exp(-Ai) with A = J(Q)/Q(0) = jt(r a + 
l)istat(r a + 1)/Qstat(? , s + 1)- We conclude that 
the waiting time for a single site is positive with 
probability Qstatt^s + 1) ; following a distribution 
~ exp(— At), and otherwise. The resulting ex- 
pectation value is To = Qstat(?*s + 1)/A. For L\ 
independent sites, the distribution of positive wait- 
ing times is still exponential, and To is given by 
an expression of the form (|17|l with a total bound- 
ary current J(t,Lx) = dQ Ll (t)/dt. This yields 
T = Qs t l t (r s + \)/ L\\ as given by JS). The av- 
erage waiting time (in units of l//x) becomes large 
for values of r s in the tail of the distribution V(r), 
where Q^at( r s + 1) ~ 1- This is the case for 
r s <r— (£,L)-1. 
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